Outlier Detection

Running ./bin/parse.sh will create the source file for outlier detection: ./data/tirol_obituaries_deduped_weekly_outlier_detection_features.csv.

The columns of this file are: district,municipaly,year,week,count,yearly_max,weekly_max

yearly_max and weekly_max are the derived feature vectors. They are the difference between each rows count and the municipaly's maximum value of the row's week and the yearly maximum of that municipaly.

To calculate the outlier_score for each row, use the following configuration with Elasticsearch's Outlier Detection feature:

{
  "id": "tirol_outlier_high_count_2_1",
  "description": "",
  "source": {
    "index": [
      "tirol_outlier_high_count_2"
    ],
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": "tirol_outlier_high_count_dest_2_1",
    "results_field": "ml"
  },
  "analysis": {
    "outlier_detection": {
      "compute_feature_influence": true,
      "outlier_fraction": 0.05,
      "standardization_enabled": true
    }
  },
  "analyzed_fields": {
    "includes": [],
    "excludes": [
      "count",
      "week",
      "year"
    ]
  },
  "model_memory_limit": "2mb",
  "create_time": 1587934690989,
  "version": "8.0.0",
  "allow_lazy_start": false
}

To learn more about the feature head over to the docs here.